Ngram Search Engine
نویسنده
چکیده
In this paper, we will describe an idea and its implementation for an ngram search engine for very large sets of ngrams. The engine supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. We implemented the system using two datasets. One is the 1 billion 5-grams provided by Google (Web 1T data), the other a set of 119 million 9grams created from 82 years of newspaper. The system runs on a single Linux PC with reasonable size of memory (less than 4GB) and disk space (less than 400GB). This system can be a very useful tool for knowledge discovery and other NLP tasks.
منابع مشابه
A Proposal for Enhancement of Elasticsearch by mitigating n-Gram Indexing
Searching is one of the most important activity in the world of Internet. Whenever one looks for any information in the World-Wide Web (WWW), the very first activityperformed is searching. As the amount of data in World-Wide Web (WWW) is increasing at a very fast rate, it is becoming very difficult to derive useful information from it. It allows every ordinary user to publish data that can be r...
متن کاملPeachnote: Music Score Search and Analysis Platform
Hundreds of thousands of music scores are being digitized by libraries all over the world. In contrast to books, they generally remain inaccessible for content-based retrieval and algorithmic analysis. There is no analogue to Google Books for music scores, and there exist no large corpora of symbolic music data that would empower musicology in the way large text corpora are empowering computati...
متن کاملNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). It outputs the matched ngrams with their frequencies as well as all the context...
متن کاملIntroducing Linggle: From Concordance to Linguistic Search Engine
We introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. Unlike a typical concordance, Linggle accepts queries with keywords, wildcard, wild part of speech (PoS), synonymous words, and additional regular expression (RE) operators, and returns bundles with frequency counts. In our approach, we argument Google Web 1T corpus with inv...
متن کاملLinggle: a Web-scale Linguistic Search Engine for Words in Context
In this paper, we introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. The query might contain keywords, wildcards, wild parts of speech (PoS), synonyms, and additional regular expression (RE) operators. In our approach, we incorporate inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichl...
متن کامل